Version: v1.3.0

Instacart Market Basket Analysis: Leveraging Xplainable Classifier for Transparent Recommendations

Introduction

Welcome to our comprehensive walkthrough of the Instacart Market Basket Analysis Challenge on Kaggle. This challenge presents an opportunity for data enthusiasts to dive deep into the world of grocery shopping and unravel the patterns behind consumer purchase behavior. Instacart, a prominent online grocery delivery platform, has provided a rich dataset that includes a information on customer orders through time. The objective? To predict which previously purchased products will be in a user's next order.

As data scientists and technical experts, we understand the criticality of not just making accurate predictions but also being able to interpret and explain our models. This is where our novel approach comes into play. We're introducing the Xplainable algorithm - a novel machine learning algorithm designed to enhance the transparency and interpretability of the recommendations.

Enhancing Brand Trust through Two-Way Transparency

An integral aspect of our approach with the Xplainable algorithm is fostering two-way transparency between the recommender system and the customer. In traditional recommender systems, users are often left wondering why certain products are recommended to them. Our method provides the opportunity to bridge this gap by incorporating an explanatory feature, such as "users like yourself also purchase."

The Importance of Relatable Recommendations

By providing context such as "users like yourself also purchase," we achieve several key objectives:

Enhanced User Engagement: When users understand the rationale behind recommendations, they are more likely to explore and accept these suggestions.
Increased Personalisation: This approach reflects a deeper understanding of user behaviour and preferences, leading to more personalised shopping experiences.
Greater Brand Trust and Loyalty: Transparency in recommendations fosters trust. When users feel that their needs are understood and catered to, it enhances their loyalty to the brand.
Feedback Loop for Continuous Improvement: Such transparent systems encourage user feedback, providing valuable insights that can be used to further refine and improve the recommender system.

By implementing a two-way transparent model, we not only elevate the accuracy of our predictions but also enrich the user experience, instilling a sense of trust and reliability in the brand. In this walkthrough, we will explore how the Xplainable algorithm not only achieves high accuracy in predicting the next set of products for Instacart users but also provides clear insights into the 'why' behind its predictions.

import xplainable as xp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()
import warnings
warnings.filterwarnings('ignore')

# Garbage Collector to free up memory
import gc                         
gc.enable()   

print(f"This notebook was created using Xplainable version {xp.__version__}")

Out:

This notebook was created using Xplainable version 1.2.3

Load the datasets

It's possible to download the Instacart price prediction dataset at the following link: https://www.kaggle.com/competitions/instacart-market-basket-analysis/data

Following extraction of the .zip file build the dataset as below:

# Load the datasets
orders = pd.read_csv('./data/orders.csv' )
order_products_train = pd.read_csv('./data/order_products__train.csv')
order_products_prior = pd.read_csv('./data/order_products__prior.csv')
products = pd.read_csv('./data/products.csv')
aisles = pd.read_csv('./data/aisles.csv')
departments = pd.read_csv('./data/departments.csv')

Inspecting orders dataset

# Get heads for the orders dataset
orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	nan
1	2398795	1	prior	2	3	7	15
2	473747	1	prior	3	3	12	21
3	2254736	1	prior	4	4	7	29
4	431534	1	prior	5	4	15	28

# Get shape of orders dataset
orders.shape

Out:

(3421083, 7)

# Get a brief descriptive info on orders
orders.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
#   Column                  Dtype  
---  ------                  -----  
0   order_id                int64  
1   user_id                 int64  
2   eval_set                object 
3   order_number            int64  
4   order_dow               int64  
5   order_hour_of_day       int64  
6   days_since_prior_order  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB

# Get missing values in orders dataset
orders.isnull().sum()

Out:

order_id 0

user_id 0

eval_set 0

order_number 0

order_dow 0

order_hour_of_day 0

days_since_prior_order 206209

dtype: int64

We have observed that there are 206209 missing values in days_since_prior_order column.

Inspecting order_products_train dataset

# Get heads for the order_products_train dataset
order_products_train.head()

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1

# Get shape of order_products_train dataset
order_products_train.shape

Out:

(1384617, 4)

# Get missing values in order_products_train dataset
order_products_train.isnull().sum()

Out:

order_id 0

product_id 0

add_to_cart_order 0

reordered 0

dtype: int64

Inspecting order_products_prior dataset

# Get head for order_products_prior
order_products_prior.head()

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

# Get shape for order_products_prior
order_products_prior.shape

Out:

(32434489, 4)

# Get missing value for order_products_prior
order_products_prior.isnull().sum()

Out:

order_id 0

product_id 0

add_to_cart_order 0

reordered 0

dtype: int64

Inspect products dataset

# Get heads for the products dataset
products.head()

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13

#  Get shape for products
products.shape

Out:

(49688, 4)

# Get missing value for products
products.isnull().sum()

Out:

product_id 0

product_name 0

aisle_id 0

department_id 0

dtype: int64

Inspecting aisles dataset

# Get head for aisles
aisles.head()

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

# Get shape for aisles
aisles.shape

Out:

(134, 2)

# Check missing values in aisle
aisles.isnull().sum()

Out:

aisle_id 0

aisle 0

dtype: int64

Inspecting departments dataset

# Get head for departments
departments.head()

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol

# Get shape for departments
departments.shape

Out:

(21, 2)

# Check missing values in departments
departments.isnull().sum()

Out:

department_id 0

department 0

dtype: int64

Exploratory Data Analysis (EDA)

# Get the number of orders in each days of a week
plt.figure(figsize=(6,4))
sns.countplot(x="order_dow", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by week day", fontsize=15)
plt.show()

The number of orders on weekends is more compared to weekdays as people stay at home and might have wanted to enjoy the foods.

# Get the number of orders for each hour in a day
plt.figure(figsize=(6,4))
sns.countplot(x="order_hour_of_day", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by Hour of day", fontsize=15)
plt.show()

Peak hours wheremaximum orders are done is between 9 AM- 5PM. Less orders are placed before 7AM and after 11 PM.

# Analyse how people orders a new one since last order
plt.figure(figsize=(10,6))
sns.countplot(orders['days_since_prior_order'])
plt.xticks(rotation=90)
plt.show()

Maximum number of users order again after 1 month. People also order after a week and this forms the second largest order habit.

# Merge products and departments dataframes and then merging with aisles
products_details = pd.merge(left=products,right=departments,how="left")
products_details = pd.merge(left=products_details,right=aisles,how="left")
products_details.head()

	product_id	product_name	aisle_id	department_id	department	aisle
0	1	Chocolate Sandwich Cookies	61	19	snacks	cookies cakes
1	2	All-Seasons Salt	104	13	pantry	spices seasonings
2	3	Robust Golden Unsweetened Oolong Tea	94	7	beverages	tea
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	frozen	frozen meals
4	5	Green Chile Anytime Sauce	5	13	pantry	marinades meat preparation

# Get the number of products in each department
plt.figure(figsize=(10,6))
g=sns.countplot(x="department",data=products_details)
g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
plt.show()

Personal care is the most abundant type of department available followed by snacks.

# Get top 10 aisle with most number of products
plt.figure(figsize=(10,6))
top10_aisle=products_details["aisle"].value_counts()[:10].plot(kind="bar",title='Aisles')

missing is the aisle with most products available.

# Merge order_products_train and products dataframes
order_products_name_train = pd.merge(left=order_products_train,right=products.loc[:,["product_id","product_name"]],on="product_id",how="left")

# Get the top 10 common products which are most bought by the customers
common_Products=order_products_name_train[order_products_name_train.reordered == 1]["product_name"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="product_name", y="index", data=common_Products.head(10))
plt.ylabel('product_name', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Banana is the most common type of product bought by people followed by Bag of organic banana.

# Merge order_products_name_train and products_details dataframes
order_products_name_train = pd.merge(left=order_products_name_train,right=products_details.loc[:,["product_id","aisle","department"]],on="product_id",how="left")

#  Get the aisles which have most number of sales
common_aisle=order_products_name_train["aisle"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="aisle", y="index", data=common_aisle.head(10),palette="Blues_d")
plt.ylabel('aisle', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Fresh vegetable aisle has the highest number of sales followed by fresh_fruits.

#  Get the departments which have most number of sales
common_aisle=order_products_name_train["department"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="department", y="index", data=common_aisle,palette="Blues_d")
plt.ylabel('department', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

produce and dairy eggs are the top 2 departments with the highest number of sales.

# Get the products which were reordered in each order from the train data.
train_data_reordered = order_products_train.groupby(["order_id","reordered"])["product_id"].apply(list).reset_index()
train_data_reordered = train_data_reordered[train_data_reordered.reordered == 1].drop(columns=["reordered"]).reset_index(drop=True)
train_data_reordered.head()

	order_id	product_id
0	1	[49302, 11109, 43633, 22035]
1	36	[19660, 43086, 46620, 34497, 48679, 46979]
2	38	[21616]
3	96	[20574, 40706, 27966, 24489, 39275]
4	98	[8859, 19731, 43654, 13176, 4357, 37664, 34065...

Feature Engineering

# Delete all unnncessary dataframes
del products_details
del order_products_name_train
del common_Products
del common_aisle
del train_data_reordered
gc.collect()

Out:

# Get 15% of users as it is a huge dataset and will take lots of time to train.
#orders = orders.loc[orders.user_id.isin(orders.user_id.drop_duplicates().sample(frac=0.15, random_state=101))] 

# Convert character variables into category. 
aisles['aisle'] = aisles['aisle'].astype('category')
departments['department'] = departments['department'].astype('category')
orders['eval_set'] = orders['eval_set'].astype('category')
products['product_name'] = products['product_name'].astype('category')

# Merge orders and order_products_prior datasets to get prior order dataset
prior_orders = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prior_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order
0	2539329	1	prior	1	2	8	nan	196	1
1	2539329	1	prior	1	2	8	nan	14084	2
2	2539329	1	prior	1	2	8	nan	12427	3
3	2539329	1	prior	1	2	8	nan	26088	4
4	2539329	1	prior	1	2	8	nan	26405	5

Create Features using user_id

# Create a feature based on number of orders placed by each user.
users = prior_orders.groupby(by='user_id')['order_number'].aggregate('max').to_frame('num_of_orders_for_each_user').reset_index()
users.head()

	user_id	num_of_orders_for_each_user
0	1	10
1	2	14
2	3	12
3	4	5
4	5	4

# Get average number of products in  each orders placed by each users.

# Get the total number of products in each order.
toal_product_per_order = prior_orders.groupby(by=['user_id', 'order_id'])['product_id'].aggregate('count').to_frame('total_products_per_order').reset_index()

# Create a dataframe to get the average number of products purchased in  each orders by each user
avg_number_of_products_per_order = toal_product_per_order.groupby(by=['user_id'])['total_products_per_order'].mean().to_frame('avg_no_prd_per_order').reset_index()

# Delete unnecessay the toal_product_per_order dataframe
del [toal_product_per_order]
gc.collect()

#  Get head of avg_number_of_products_per_order
avg_number_of_products_per_order.head()

	user_id	avg_no_prd_per_order
0	1	5.9
1	2	13.9286
2	3	7.33333
3	4	3.6
4	5	9.25

from scipy import stats
import pandas as pd
import numpy as np

def calculate_mode(x):
    if len(x) > 0:
        mode_result = stats.mode(x)
        # Check if mode_result.mode is an array and has at least one element
        if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
            return mode_result.mode[0]
        else:
            return mode_result.mode
    else:
        return pd.NA

# Create a dataframe for the day of the week where users order most
order_most_dow = prior_orders.groupby(by=['user_id'])['order_dow'].aggregate(calculate_mode).to_frame('dow_with_most_orders').reset_index()

# Get head of the dataset
order_most_dow.head()

	user_id	dow_with_most_orders
0	1	4
1	2	2
2	3	0
3	4	4
4	5	3

def calculate_mode_hour(x):
    if len(x) > 0:
        mode_result = stats.mode(x)
        # Check if mode_result.mode is an array and has at least one element
        if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
            return mode_result.mode[0]
        else:
            return mode_result.mode
    else:
        return pd.NA

# Create a dataframe for the hour of day where users have ordered most
order_most_hod = prior_orders.groupby(by=['user_id'])['order_hour_of_day'].aggregate(calculate_mode_hour).to_frame('hod_with_most_orders').reset_index()

# Display the first few rows of the dataframe
order_most_hod.head()

	user_id	hod_with_most_orders
0	1	7
1	2	9
2	3	16
3	4	15
4	5	18

#Get a dataframe for reorder ratio of each user and set data type as float
user_reorder_ratio = prior_orders.groupby(by='user_id')['reordered'].aggregate('mean').to_frame('reorder_ratio').reset_index()
user_reorder_ratio['reorder_ratio'] = user_reorder_ratio['reorder_ratio'].astype(np.float16)
user_reorder_ratio.head()

	user_id	reorder_ratio
0	1	0.694824
1	2	0.476807
2	3	0.625
3	4	0.055542
4	5	0.378418

# Merging all the created user based features into the users dataset one by one.
users = users.merge(avg_number_of_products_per_order, on='user_id', how='left')
users = users.merge(order_most_dow, on='user_id', how='left')
users = users.merge(order_most_hod, on='user_id', how='left')
users = users.merge(user_reorder_ratio, on='user_id', how='left')

users.head()

	user_id	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio
0	1	10	5.9	5.9	4	7	0.694824
1	2	14	13.9286	13.9286	2	9	0.476807
2	3	12	7.33333	7.33333	0	16	0.625
3	4	5	3.6	3.6	4	15	0.055542
4	5	4	9.25	9.25	3	18	0.378418

# Delete unnecessay dataframes
del [avg_number_of_products_per_order,order_most_dow,order_most_hod,user_reorder_ratio]
gc.collect()

Out:

Create features using product_id.

#Get a dataframe to show the number of times a product has been purchased.
purchased_num_of_times = prior_orders.groupby(by='product_id')['order_id'].aggregate('count').to_frame('purchased_num_of_times').reset_index()
purchased_num_of_times.head()

	product_id	purchased_num_of_times
0	1	1852
1	2	90
2	3	277
3	4	329
4	5	15

#Get a dataframe for the reordered ratio for each product
product_reorder_ratio = prior_orders.groupby(by='product_id')['reordered'].aggregate('mean').to_frame('product_reorder_ratio').reset_index()
product_reorder_ratio.head()

	product_id	product_reorder_ratio
0	1	0.613391
1	2	0.133333
2	3	0.732852
3	4	0.446809
4	5	0.6

#Get a dataframe for avearage number of adding to cart for each product.
add_to_cart = prior_orders.groupby(by='product_id')['add_to_cart_order'].aggregate('mean').to_frame('product_avg_cart_addition').reset_index()
add_to_cart.head()

	product_id	product_avg_cart_addition
0	1	5.80184
1	2	9.88889
2	3	6.41516
3	4	9.5076
4	5	6.46667

# Merge all the created features based on product_id into the purchased_num_of_times dataset.
purchased_num_of_times = purchased_num_of_times.merge(product_reorder_ratio, on='product_id', how='left')
purchased_num_of_times = purchased_num_of_times.merge(add_to_cart, on='product_id', how='left')

#Delete unwanted dataframes.
del [product_reorder_ratio, add_to_cart]
gc.collect()

Out:

# Get head of purchased_num_of_times
purchased_num_of_times.head()

	product_id	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition
0	1	1852	0.613391	5.80184
1	2	90	0.133333	9.88889
2	3	277	0.732852	6.41516
3	4	329	0.446809	9.5076
4	5	15	0.6	6.46667

Creating features using user_id and product_id

#Create a user_product dataframe which shows the number of times a user have bough a product.
user_product_data = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_times_bought').reset_index()
user_product_data.head()

	user_id	product_id	uxp_times_bought
0	1	196	10
1	1	10258	9
2	1	10326	1
3	1	12427	10
4	1	13032	3

#Create a dataframe  to find a product's order number when the user has bought a product for the first time.
product_first_order_num = prior_orders.groupby(by=['user_id', 'product_id'])['order_number'].aggregate('min').to_frame('first_order_number').reset_index()
product_first_order_num.head()

	user_id	product_id	first_order_number
0	1	196	1
1	1	10258	2
2	1	10326	5
3	1	12427	1
4	1	13032	2

#Get total number of orders by each user
total_orders = prior_orders.groupby('user_id')['order_number'].max().to_frame('total_orders').reset_index()
total_orders.head()

	user_id	total_orders
0	1	10
1	2	14
2	3	12
3	4	5
4	5	4

# Merge total_orders and user_product_data dataframes to create a new dataframe user_product_df
user_product_df = pd.merge(total_orders, product_first_order_num, on='user_id', how='right')
user_product_df.head()

	user_id	total_orders	product_id	first_order_number
0	1	10	196	1
1	1	10	10258	2
2	1	10	10326	5
3	1	10	12427	1
4	1	10	13032	2

# Calculate the order range.
# The +1 includes in the difference is the first order where the product has been purchased
user_product_df['order_range'] = user_product_df['total_orders'] - user_product_df['first_order_number'] + 1
user_product_df.head()

	user_id	total_orders	product_id	first_order_number	order_range
0	1	10	196	1	10
1	1	10	10258	2	9
2	1	10	10326	5	6
3	1	10	12427	1	10
4	1	10	13032	2	9

#Create  a dataframe to show the number of times a user have bough a product.
number_of_times = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('times_bought').reset_index()
number_of_times.head()

	user_id	product_id	times_bought
0	1	196	10
1	1	10258	9
2	1	10326	1
3	1	12427	10
4	1	13032	3

# Merging number_of_times with user_product_df
uxp_ratio = pd.merge(number_of_times, user_product_df, on=['user_id', 'product_id'], how='left')
uxp_ratio.head()

	user_id	product_id	times_bought	total_orders	first_order_number	order_range
0	1	196	10	10	1	10
1	1	10258	9	10	2	9
2	1	10326	1	10	5	6
3	1	12427	10	10	1	10
4	1	13032	3	10	2	9

# Get a dataframe to calculate the reorder ratio for each product
uxp_ratio['uxp_reorder_ratio'] = uxp_ratio['times_bought'] / uxp_ratio['order_range']
uxp_ratio.head()

	user_id	product_id	times_bought	total_orders	first_order_number	order_range	uxp_reorder_ratio
0	1	196	10	10	1	10	1
1	1	10258	9	10	2	9	1
2	1	10326	1	10	5	6	0.166667
3	1	12427	10	10	1	10	1
4	1	13032	3	10	2	9	0.333333

#Drop all the unnecessary columns from uxp_ratio dataframe .
uxp_ratio.drop(['times_bought', 'total_orders', 'first_order_number', 'order_range'], axis=1, inplace=True)
uxp_ratio.head()

	user_id	product_id	uxp_reorder_ratio
0	1	196	1
1	1	10258	1
2	1	10326	0.166667
3	1	12427	1
4	1	13032	0.333333

#Merge uxp_ratio with user_product_data.
user_product_data = user_product_data.merge(uxp_ratio, on=['user_id', 'product_id'], how='left')

# Delete all unnecessay datasets
del [product_first_order_num, number_of_times,user_product_df,total_orders, uxp_ratio]
gc.collect()

Out:

# Get head for user_product_data
user_product_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio
0	1	196	10	1
1	1	10258	9	1
2	1	10326	1	0.166667
3	1	12427	10	1
4	1	13032	3	0.333333

#Create a column order_number_back to reverse the order number for each product in prior_orders dataframe
prior_orders['order_number_back'] = prior_orders.groupby(by=['user_id'])['order_number'].transform(max) - prior_orders.order_number + 1
prior_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	order_number_back
0	2539329	1	prior	1	2	8	nan	196	1	10
1	2539329	1	prior	1	2	8	nan	14084	2	10
2	2539329	1	prior	1	2	8	nan	12427	3	10
3	2539329	1	prior	1	2	8	nan	26088	4	10
4	2539329	1	prior	1	2	8	nan	26405	5	10

#Update the dataframe to keep only the first 3 orders from the order_number_back.
temp_df = prior_orders.loc[prior_orders.order_number_back <= 3]
temp_df.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	order_number_back
38	3108588	1	prior	8	1	14	14	12427	1	1	3
39	3108588	1	prior	8	1	14	14	196	2	1	3
40	3108588	1	prior	8	1	14	14	10258	3	1	3
41	3108588	1	prior	8	1	14	14	25133	4	1	3
42	3108588	1	prior	8	1	14	14	46149	5	0	3

#Get the products bought by users in the last 3 orders.
last_three_order = temp_df.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_last_three').reset_index()
last_three_order.head()

	user_id	product_id	uxp_last_three
0	1	196	3
1	1	10258	3
2	1	12427	3
3	1	13032	1
4	1	25133	3

#Get the ratio of the products bought in the last 3 orders.
last_three_order['uxp_ratio_last_three'] = last_three_order['uxp_last_three'] / 3
last_three_order.head()

	user_id	product_id	uxp_last_three	uxp_ratio_last_three
0	1	196	3	1
1	1	10258	3	1
2	1	12427	3	1
3	1	13032	1	0.333333
4	1	25133	3	1

#Merge last_three_order with feature with uxp df.
user_product_data = user_product_data.merge(last_three_order, on=['user_id', 'product_id'], how='left')

# Delete unwanted dataframes
del [last_three_order, temp_df]
gc.collect()

Out:

# Get the head of updated dataframe user_product_data.head()
user_product_data.head().head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three
0	1	196	10	1	3	1
1	1	10258	9	1	3	1
2	1	10326	1	0.166667	nan	nan
3	1	12427	10	1	3	1
4	1	13032	3	0.333333	1	0.333333

# Check any missing values in user_product_data columns
user_product_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 8382738

uxp_ratio_last_three 8382738

dtype: int64

#Fill the NAN values with 0 in user_product_data.
user_product_data.fillna(0, inplace=True)

# Confirm filling of missing values in user_product_data columns
user_product_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

dtype: int64

Create final dataframe for engineered features

#  Merge user_product_data and users, then resulting dataframe with purchased_num_of_times to get featured_engineered_data dataset
featured_engineered_data = user_product_data.merge(users, on='user_id', how='left')
featured_engineered_data = featured_engineered_data.merge(purchased_num_of_times, on='product_id', how='left')

# Delete unncessary dataframes.
del [users, user_product_data, purchased_num_of_times]
gc.collect()

# Get head of featured_engineered_data
featured_engineered_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277

# Check if any missing values found in featured_engineered_data dataframe columns
featured_engineered_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

num_of_orders_for_each_user 0

avg_no_prd_per_order_x 0

avg_no_prd_per_order_y 0

dow_with_most_orders 0

hod_with_most_orders 0

reorder_ratio 0

purchased_num_of_times 0

product_reorder_ratio 0

product_avg_cart_addition 0

dtype: int64

Creating Train and Test datasets

Create training dataset

# Keep only the future orders from all customers i.e. train  and test orders
orders_future = orders[((orders.eval_set=='train') | (orders.eval_set=='test'))]
orders_future = orders_future[['user_id', 'eval_set', 'order_id']]

# merge the orders_future with featured_engineered_data to create a final dataframe.
final_data = featured_engineered_data.merge(orders_future, on='user_id', how='left')
final_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899

#Create training data set.
train_data = final_data[final_data.eval_set=='train']
train_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899

#Merge order_products__train  with into train_data datframe.
train_data = train_data.merge(order_products_train[['product_id', 'order_id', 'reordered']], on=['product_id', 'order_id'], how='left')
train_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id	reordered
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899	1
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899	1
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899	nan
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899	nan
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899	1

# Check if any missing values found in train_data dataframe columns
train_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

num_of_orders_for_each_user 0

avg_no_prd_per_order_x 0

avg_no_prd_per_order_y 0

dow_with_most_orders 0

hod_with_most_orders 0

reorder_ratio 0

purchased_num_of_times 0

product_reorder_ratio 0

product_avg_cart_addition 0

eval_set 0

order_id 0

reordered 7645837

dtype: int64

# Fill the missing values in reordered column with 0
train_data['reordered'] = train_data['reordered'].fillna(0)
train_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id	reordered
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899	1
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899	1
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899	0
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899	0
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899	1

# Set user_id and product_id as the index of the train_data
train_data = train_data.set_index(['user_id', 'product_id'])

# Drop unwanted columns from train_data dataframe
train_data = train_data.drop(['eval_set', 'order_id'], axis=1)

# Get head of train_data
train_data.head()

		uxp_times_bought_Unnamed: 2_level_1	uxp_reorder_ratio_Unnamed: 3_level_1	uxp_last_three_Unnamed: 4_level_1	uxp_ratio_last_three_Unnamed: 5_level_1	num_of_orders_for_each_user_Unnamed: 6_level_1	avg_no_prd_per_order_x_Unnamed: 7_level_1	avg_no_prd_per_order_y_Unnamed: 8_level_1	dow_with_most_orders_Unnamed: 9_level_1	hod_with_most_orders_Unnamed: 10_level_1	reorder_ratio_Unnamed: 11_level_1	purchased_num_of_times_Unnamed: 12_level_1	product_reorder_ratio_Unnamed: 13_level_1	product_avg_cart_addition_Unnamed: 14_level_1	reordered_Unnamed: 15_level_1
1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	1
1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	1
1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	0
1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	0
1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	1

Create testing dataset

#Keep only the future orders labelled as test
test_data = final_data[final_data.eval_set=='test']
test_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id
120	3	248	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	6371	0.400251	10.6208	test	2774568
121	3	1005	1	0.333333	1	0.333333	12	7.33333	7.33333	16	0.625	463	0.440605	9.49892	test	2774568
122	3	1819	3	0.333333	0	0	12	7.33333	7.33333	16	0.625	2424	0.492162	9.28754	test	2774568
123	3	7503	1	0.1	0	0	12	7.33333	7.33333	16	0.625	12474	0.553551	9.54738	test	2774568
124	3	8021	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	27864	0.591157	8.82285	test	2774568

# Set user_id and product_id as the index of the train_data
test_data = test_data.set_index(['user_id', 'product_id'])

# Drop unwanted columns from train_data dataframe
test_data = test_data.drop(['eval_set', 'order_id'], axis=1)

# Get head of test_data 
test_data.head()

		uxp_times_bought_Unnamed: 2_level_1	uxp_reorder_ratio_Unnamed: 3_level_1	uxp_last_three_Unnamed: 4_level_1	uxp_ratio_last_three_Unnamed: 5_level_1	num_of_orders_for_each_user_Unnamed: 6_level_1	avg_no_prd_per_order_x_Unnamed: 7_level_1	avg_no_prd_per_order_y_Unnamed: 8_level_1	hod_with_most_orders_Unnamed: 10_level_1	reorder_ratio_Unnamed: 11_level_1	purchased_num_of_times_Unnamed: 12_level_1	product_reorder_ratio_Unnamed: 13_level_1	product_avg_cart_addition_Unnamed: 14_level_1
3	248	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	6371	0.400251	10.6208
3	1005	1	0.333333	1	0.333333	12	7.33333	7.33333	16	0.625	463	0.440605	9.49892
3	1819	3	0.333333	0	0	12	7.33333	7.33333	16	0.625	2424	0.492162	9.28754
3	7503	1	0.1	0	0	12	7.33333	7.33333	16	0.625	12474	0.553551	9.54738
3	8021	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	27864	0.591157	8.82285

#Delete unnecessay dataframes
del [final_data, orders_future, products, order_products_train]
gc.collect()

Out:

Building model using Xplainable Classifier

Build X_train and y_train dataset

# Define X_train and y_train
X_train, y_train = train_data.drop('reordered', axis=1), train_data.reordered

Optimise on 1,000,000 rows

from xplainable.core.optimisation.bayesian import XParamOptimiser

opt = XParamOptimiser()
params = opt.optimise(X_train[:1000000], y_train[:1000000])

Out:

100%|████████| 30/30 [00:45<00:00,  1.53s/trial, best loss: -0.8107380819617527]

Use params from XParamOptimiser() to fit the xplainable classifier

from xplainable.core.models import XClassifier

model = XClassifier(**params)
model.fit(X_train, y_train)

Out:

<xplainable.core.ml.classification.XClassifier at 0x2a426f760>

Model Explanations for Item Recommender

Plot model explanations using the .explain() method

model.explain()

In the Feature Importances section, we see a list of features with corresponding importance values. The feature uxp_reorder_ratio has the highest importance, indicating that it is the most influential factor in the model's predictions.

On the Contributions side, the uxp_reorder_ratio feature also shows a notable contribution to the model's output. The green bars represent positive contributions, while the red bars indicate negative contributions. The specific contribution values are again not directly visible, but the length and color of the bars suggest that uxp_reorder_ratio has a strong positive influence on the model's predictions.

Create model predictions using threshold cutoff NOTE: Adjust the threshold cutoff to see the impact on the result

# Get model predictions with threshold of 0.21 probability
test_prediction = (model.predict_proba(test_data) >= 0.21).astype(int)
test_prediction[:5]

Out:

array([0, 0, 0, 0, 0])

train_prediction = (model.predict_proba(X_train) >= 0.21).astype(int)
train_prediction[:5]

Out:

array([1, 1, 0, 1, 0])

# Import evaluation matrices
from sklearn.metrics import f1_score, classification_report

# Get f1 score and classification report
print(f'f1 Score: {f1_score(train_prediction, y_train)}')
print(classification_report(train_prediction, y_train))

Out:

f1 Score: 0.41808008442097677

precision recall f1-score support

0 0.92 0.94 0.93 7520042

1 0.45 0.39 0.42 954619

accuracy 0.88 8474661

macro avg 0.69 0.66 0.67 8474661

weighted avg 0.87 0.88 0.87 8474661

#Create the prediction as a new column in test_data
test_data['prediction'] = test_prediction
test_data.head()

		uxp_times_bought_Unnamed: 2_level_1	uxp_reorder_ratio_Unnamed: 3_level_1	uxp_last_three_Unnamed: 4_level_1	uxp_ratio_last_three_Unnamed: 5_level_1	num_of_orders_for_each_user_Unnamed: 6_level_1	avg_no_prd_per_order_x_Unnamed: 7_level_1	avg_no_prd_per_order_y_Unnamed: 8_level_1	hod_with_most_orders_Unnamed: 10_level_1	reorder_ratio_Unnamed: 11_level_1	purchased_num_of_times_Unnamed: 12_level_1	product_reorder_ratio_Unnamed: 13_level_1	product_avg_cart_addition_Unnamed: 14_level_1
3	248	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	6371	0.400251	10.6208
3	1005	1	0.333333	1	0.333333	12	7.33333	7.33333	16	0.625	463	0.440605	9.49892
3	1819	3	0.333333	0	0	12	7.33333	7.33333	16	0.625	2424	0.492162	9.28754
3	7503	1	0.1	0	0	12	7.33333	7.33333	16	0.625	12474	0.553551	9.54738
3	8021	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	27864	0.591157	8.82285

# Reset the index and create a dataset called final_df
final_df = test_data.reset_index()

# Keeping only the required columns to create  submission file
final_df = final_df[['product_id', 'user_id', 'prediction']]

# Collect garbage and show head of final_df
gc.collect()
final_df.head()

	product_id	user_id
0	248	3
1	1005	3
2	1819	3
3	7503	3
4	8021	3

Creating the Kaggle submission file (optional)

After developing a robust model and ensuring its performance on our validation set, the next step is to prepare our submission for Kaggle. Although this step is optional, it is a good practice to understand how to create a submission file that adheres to the competition's requirements.

To create a submission file, you typically need to:

Ensure that your model has been trained with the full training set or with an appropriate cross-validation strategy.
Generate predictions for the test set provided by Kaggle.
Format these predictions into a CSV file that matches the submission format of the competition, which usually involves setting the index to an id column and including a column with your predictions.
Use the to_csv() function from pandas with the appropriate parameters, such as index=False if the index should not be included in the submission file, to save your dataframe to a CSV file.
Upload this CSV file to the Kaggle competition's submission page to see how your model performs on the unseen test set.

See specific steps for the kaggle upload below

#Create  a new dataframe orders_test 
orders_test = orders.loc[orders.eval_set == 'test', ['user_id', 'order_id']]
orders_test.head()

	user_id	order_id
38	3	2.77457e+06
44	4	329954
53	6	1.52801e+06
96	11	1.37694e+06
102	12	1.35684e+06

#Merge  final_df with orders_test daatframe
final_df = final_df.merge(orders_test, on='user_id', how='left')
final_df.head()

	product_id	user_id	order_id
0	248	3	2.77457e+06
1	1005	3	2.77457e+06
2	1819	3	2.77457e+06
3	7503	3	2.77457e+06
4	8021	3	2.77457e+06

# Remove user_id column and convert product_id as integer
final_df = final_df.drop('user_id', axis=1)
final_df['product_id'] = final_df.product_id.astype(int)
final_df.head()

	product_id	order_id
0	248	2.77457e+06
1	1005	2.77457e+06
2	1819	2.77457e+06
3	7503	2.77457e+06
4	8021	2.77457e+06

# Create a dictionary to store product IDs whose reordered prediction value is 1
final_dict = dict()
for row in final_df.itertuples():
    if row.prediction== 1:
        try:
            final_dict[row.order_id] += ' ' + str(row.product_id)
        except:
            final_dict[row.order_id] = str(row.product_id)

# Update products whose reorder prediction value is 0 as None          
for order in final_df.order_id:
    if order not in final_dict:
        final_dict[order] = 'None'
 
# Collect garbage
gc.collect()

Out:

# Convert the final_dict dictionary into a final submission dataframe called submission_df
submission_df = pd.DataFrame.from_dict(final_dict, orient='index')

# Reset index
submission_df.reset_index(inplace=True)

#Set column names
submission_df.columns = ['order_id', 'products']

# Get head
submission_df.head()

	order_id	products
0	2774568	17668 18599 21903 39190 43961 47766
1	1528013	21903 38293
2	1376945	8309 13176 14947 20383 27959 33572 35948 44632
3	1356845	5746 7076 8239 10863 11520 13176 14992
4	2161313	196 10441 11266 12427 14715 27839 37710

# Create the final submission file
submission_df.to_csv('sub.csv', index=False, header=True)

submission_df.shape

Out:

(75000, 2)

Instacart Market Basket Analysis: Leveraging Xplainable Classifier for Transparent Recommendations

Introduction​

Enhancing Brand Trust through Two-Way Transparency​

The Importance of Relatable Recommendations​

Load the datasets

Exploratory Data Analysis (EDA)

Feature Engineering

Create Features using user_id​

Create features using product_id.​

Creating features using user_id and product_id​

Create final dataframe for engineered features​

Creating Train and Test datasets

Create training dataset​

Create testing dataset​

Building model using Xplainable Classifier

Build X_train and y_train dataset​

Optimise on 1,000,000 rows​

Model Explanations for Item Recommender​

Creating the Kaggle submission file (optional)

Introduction

Enhancing Brand Trust through Two-Way Transparency

The Importance of Relatable Recommendations

Create Features using user_id

Create features using product_id.

Creating features using user_id and product_id

Create final dataframe for engineered features

Create training dataset

Create testing dataset

Build X_train and y_train dataset

Optimise on 1,000,000 rows

Model Explanations for Item Recommender